# install janitor
install.packages("janitor")
# install skimr
install.packages("skimr")16 Exploratory Analysis I
Learning Objectives
After completing this tutorial you should be able to
- identify specific steps to explore a new data set
- formulate a question and determine if your data set is appropriate to answer it
- refine questions based on data contained in a data set
- outline a strategy for data exploration using graphs
Download the directory for this project here, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.
There should be a file named 16_exploratory-analysis-i.qmd in that project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set1 you may of course add as many comments as you need to be able to recall what you did. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.
1 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.
Let’s make sure to read in the libraries we need for this analysis.
Now we can load all of our libraries and get started
# load libraries
# reporting
library(knitr)
# visualization
library(ggplot2)
library(ggthemes)
library(patchwork)
# data wrangling
library(dplyr)
library(tidyr)
library(readr)
library(skimr)
library(janitor)
library(magrittr)
# turn off scientific notation
options(scipen=999)16.1 Exploratory Analysis
An exploratory analysis is systematic way of exploring data set primarily through visualizations. Generally, it is an iterative process that starts with a question or hypothesis and then, most of your exploratory analysis consists of visualizing, transforming, and even modeling your data to answer that question. Finally, you need to summarize your results and conclusions and then use those answers to refine your original question or generate new questions. Frequently, an exploratory analysis is the first step before deciding on what your formal analysis will look like and can serve to generate hypothesis for further exploration.
Part of exploratory analysis is quality control of your data, determining if it meets your expectations and cleaning it up as needed. Frequently it includes using the existing raw data to transform your data into parameters that are more useful for your assessment.
16.2 Step 1: Formulate a question
Recall, from our visualization of Climate Change indicators that we left the question of whether or not hurricanes are going “to get worse” unresolved even after comparing both changes in the number of names storms, major hurricanes, and their intensity.
Let’s say we wanted to explore whether or not natural disasters have gotten worse as climate change has progressed. To do this we can turn to the International Disaster Database.
16.3 Read & Wrangle data set
You should have a copy of the tab-delimited file in your data folder. It contains an additional 6 lines at the top that have information on when and where it was downloaded.
disaster <- read_delim("data/nat-disasters_emdat-query-2021-10-05.txt", delim = "\t", skip = 6)Our first step always is to take a look at the data set and make sure that it has read in correctly (i.e. that the columns are correctly separated, numeric columns are numeric, character columns are characters etc.) and is in a usable format.
Even though data has read in correctly does not necessarily mean that it is in the most use-able format … remember plotting and filtering data sets requires typing variable names in over and over again. Therefore, we want to make sure that everything is set up in a way where we will be able to avoid mistakes that are mostly due to typos.
For example, let’s take a look at the column names:
colnames(disaster) [1] "Dis No" "Year"
[3] "Seq" "Glide"
[5] "Disaster Group" "Disaster Subgroup"
[7] "Disaster Type" "Disaster Subtype"
[9] "Disaster Subsubtype" "Event Name"
[11] "Country" "ISO"
[13] "Region" "Continent"
[15] "Location" "Origin"
[17] "Associated Dis" "Associated Dis2"
[19] "OFDA Response" "Appeal"
[21] "Declaration" "Aid Contribution"
[23] "Dis Mag Value" "Dis Mag Scale"
[25] "Latitude" "Longitude"
[27] "Local Time" "River Basin"
[29] "Start Year" "Start Month"
[31] "Start Day" "End Year"
[33] "End Month" "End Day"
[35] "Total Deaths" "No Injured"
[37] "No Affected" "No Homeless"
[39] "Total Affected" "Reconstruction Costs ('000 US$)"
[41] "Insured Damages ('000 US$)" "Total Damages ('000 US$)"
[43] "CPI" "Adm Level"
[45] "Admin1 Code" "Admin2 Code"
[47] "Geo Locations"
Apparently, they have not read our guidelines for naming things. Column headers with special characters and spaces are super annoying, you might recall from our previous adventures in data wrangling that every time we want to call a column with spaces we are going to have to use back ticks to tell R that it isn’t separate words but a single name.
Now is probably a good time to learn how to rename columns. The most straightforward way is using rename().
Let’s say we wanted to rename Dis No to DisNo the syntax is simple NewName = OldName.
disaster %>%
rename(DisNo = `Dis No`) %>%
head()# A tibble: 6 × 47
DisNo Year Seq Glide `Disaster Group` `Disaster Subgroup` `Disaster Type`
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1900-9… 1900 9002 <NA> Natural Climatological Drought
2 1900-9… 1900 9001 <NA> Natural Climatological Drought
3 1902-0… 1902 0012 <NA> Natural Geophysical Earthquake
4 1902-0… 1902 0003 <NA> Natural Geophysical Volcanic activ…
5 1902-0… 1902 0010 <NA> Natural Geophysical Volcanic activ…
6 1903-0… 1903 0006 <NA> Natural Geophysical Mass movement …
# ℹ 40 more variables: `Disaster Subtype` <chr>, `Disaster Subsubtype` <chr>,
# `Event Name` <chr>, Country <chr>, ISO <chr>, Region <chr>,
# Continent <chr>, Location <chr>, Origin <chr>, `Associated Dis` <chr>,
# `Associated Dis2` <chr>, `OFDA Response` <chr>, Appeal <chr>,
# Declaration <chr>, `Aid Contribution` <dbl>, `Dis Mag Value` <dbl>,
# `Dis Mag Scale` <chr>, Latitude <chr>, Longitude <chr>, `Local Time` <chr>,
# `River Basin` <chr>, `Start Year` <dbl>, `Start Month` <dbl>, …
This would be pretty annoying if we wanted to rename all of our columns by hand.
Fortunately, we can leverage the function janitor::clean_names() function2 which is designed to parse letter cases and separators - the default is to convert it to snake_case, i.e. all names are lowercase and words are separated by an underscore, it also deals with duplicate names and special characters.
2 The notation format you see here specifies the Rpackage and then the function within it, as package::function().
Let’s see what this does for us:
disaster <- read_delim("data/nat-disasters_emdat-query-2021-10-05.txt", delim = "\t", skip = 6) %>%
clean_names()
colnames(disaster) [1] "dis_no" "year"
[3] "seq" "glide"
[5] "disaster_group" "disaster_subgroup"
[7] "disaster_type" "disaster_subtype"
[9] "disaster_subsubtype" "event_name"
[11] "country" "iso"
[13] "region" "continent"
[15] "location" "origin"
[17] "associated_dis" "associated_dis2"
[19] "ofda_response" "appeal"
[21] "declaration" "aid_contribution"
[23] "dis_mag_value" "dis_mag_scale"
[25] "latitude" "longitude"
[27] "local_time" "river_basin"
[29] "start_year" "start_month"
[31] "start_day" "end_year"
[33] "end_month" "end_day"
[35] "total_deaths" "no_injured"
[37] "no_affected" "no_homeless"
[39] "total_affected" "reconstruction_costs_000_us"
[41] "insured_damages_000_us" "total_damages_000_us"
[43] "cpi" "adm_level"
[45] "admin1_code" "admin2_code"
[47] "geo_locations"
That looks much better.
16.4 Step 2: Get an overview of the data set
Our next step always is to determine what information is contained in the data set and how it is organized.
First, we will need to understand what information is contained in each column. Some of the column headers are more self-explanatory than others. One of the most helpful documents at this point is the metadata which will tell us what information is in each column. A good place to look for that is the data portal where you accessed the data set itself.
Now, we can take a look at some of the specifics of how the information is organized in the data set and specifics on the information contained.
Next to all of the functons you just listed, another helpful functions is skimr::skim()
3 You may want to adjust the width of your console pane toget a better look at the output.
For numerical columns mean values and percentiles can give you an overview of how the data is distributed. However, for columns that contain strings or characters we cannot produce those types of summary statistics. Rather, we are probably more interested on how many unique values those columns contain and what those unique entries are.
skim() already gave us the number of unique entries for each column, we have two options we can use to determine what those values are.
4 These should probably be categorical values.
The first is unique(). This function makes use of the fact that each column is a vector5.
unique(disaster$disaster_subgroup)[1] "Climatological" "Geophysical" "Meteorological"
[4] "Hydrological" "Biological" "Extra-terrestrial"
When you run this function it returns a vector; while this might be helpful in some contexts, it would be helpful if we could also view it in a tabular format. Do not fret - tidyverse got you.
disaster %>%
distinct(disaster_subgroup)# A tibble: 6 × 1
disaster_subgroup
<chr>
1 Climatological
2 Geophysical
3 Meteorological
4 Hydrological
5 Biological
6 Extra-terrestrial
5 Remember, we can access each column of a dataframe as `df$columnname
16.5 Step 3: Chose a question to explore
Frequently you will have a general question which is probably why you pulled a data set in the first place. In our example that would be something along the lines of “Are natural disasters getting worse with climate change?”.
Usually, there will be more than one way to go after that question and you will likely want and need to do some extended exploration of the data set to generate additional questions (hypothesis) and ultimately find the one you are mot interested in.
In this case, we could focus on whether we are indeed observing an increase in frequency of events and/or we could compare if different groups of natural disasters that are more dependent on climate/weather conditions (storms, hurricanes) are increasing more rapidly compared to categories that have other origins (e.g. earthquakes or volcanic eruptions).
Or, we could have already done some analysis and/or background reading that indicates that natural disasters linked to climate patterns are indeed predicted to get worse based on modeling. In that case we might be more interested whether the impact of natural disasters on humans is getting increasingly worse. We could be at a point where at least some of the changes to the climate system are inevitable and in that case the question of mitigation of effects and preparedness like early warning systems or infrastructure become more important. We cannot stop the events from happening but to an extent we can control how well we can deal with them.
Exploratory Analysis is fundamentally a very creative process and thinking outside the box can be what sets your analysis apart from what everyone else is doing. However, before we can start thinking outside the box we need to first figure out what “the box is” and use that as our starting point. Unfortunately, this does mean you should always start with the boring standard stuff, cover your bases and then from there work towards something more interesting. However, not infrequently the “boring” stuff is the most revealing!
16.6 Step 4: Plot the data!
While tables are sometimes more appropriate, most exploratory analysis involves generating plots that help you get an overview of the data in relation to your question.
16.6.1 Explore the distribution of the data
A good place to start is to get an idea of the variation or variance of the data you are interested in.
For any of these assessments your should always ask the following questions
- Which values are more common than others? Why?
- Which values are more rare than others? Why?
- Are there distinct clusters that emerge? Why does/doesn’t that make sense?
- Do overall patterns meet my expectations? What could cause deviations? Why do I have these expectations of the data?
- Are there patterns that stand out? Are they unusual/unexpected? What could be causing them?
In each case, as you pick up on patterns you want to also ask “why” and “how come” questions6 … those are the ones that will lead you to the next steps of your exploration.
6 Imagine a cute toddler right next to you asking “why” at every step.
Typically, you should always explore outlier values to determine whether they are true outliers or if it is legitimate to remove them, along with patterns of missing data. Keep in mind that if there is too much missing data that could be an indication that you should be looking or a better data set.
Let’s get started. You will hopefully quickly see that for any type of exploratory analysis you start with one plot/question to visualize a pattern and then typically go through a few steps of plotting that same data set component in a few different ways to get a view from a few different angles before moving on to the next thing. Be patient and try various things even if some things you pursue end up being dead ends.
Let’s start by figuring out how many observations fall into each sub-category of natural disasters7.
7 We finally get to see the alternative to stat = "identity" in using barplots! To plot distribution of categorical variable you are now telling ggplot to count the number of observations for each category
ggplot(disaster, aes(x = disaster_subgroup)) +
geom_bar(stat = "count")and we could do the same for our disaster types.
ggplot(disaster, aes(x = disaster_type)) +
geom_bar(stat = "count")We probably want to be able to read those labels so let’s flip them by 90 degrees.
ggplot(disaster, aes(x = disaster_type)) +
geom_bar(stat = "count") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))This is giving us global numbers, we might want to break it up by region.
ggplot(disaster, aes(x = disaster_type)) +
geom_bar(stat = "count") +
facet_wrap(. ~ region) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))Hm, that’s interesting.
The next step is that our questions is connected to understanding the “getting worse” component so we have to consider ways we can measure “worseness”. One way to do that is impact. In this case, we ould want to filter our data set to include only events for which damages where assessed.
Ugh, that plot does not initially look too helpful, it seems like almost all of our values vall into the same bin. This could be because we have some outliers that are skewing our histogram.
We can get a better idea by using e.g. a square root scale, which skews the y-axis in a systematic way8.
8 This is the type of thing where in order to not be unintentionally mislead you would always want to add “Note the transformed y-scale”
ggplot(damages, aes(x = total_damages_000_us)) +
geom_histogram() +
scale_y_sqrt()Based on our second figure, it looks like we might want to limit our assessment to droughts, earthquakes, wildfires, floods, and storms, using boxplots insteed of historgrams is also going to make these comparisons easier to make.
From your figure, you’ve probably seen that the big differences seems to be what the extreme values are in eah group but it is difficult to tell how different the mean, median and general spread of the data are.
16.7 Step 5: Explore relationships among variables
After you’ve explored patterns within variables you are interested in you are also going to want to explore relationships (Covariation) among variables9.
9 Remember, when we start looking at correlation, we always need to be careful about how/when we infer causation!
Are damages incurred by natural disasters increasing over time?
We have already spent a little bit of time looking at more than one variable by looking at subgroups within our data set, specifically the disaster types and regional differences.
However, given the initial question we are asking of the data set, the relationship we are really interested in is the change over time. This means we are technically comparing two continuous values but in this specific case we frequently describe as a time-series.
Let’s take a look at the temporal patterns we can observe.
That’s helpful, but from this figure it is quite difficult to tell whether this is a consistent pattern, or if it is one that is driven for example by a specific type of disaster.
We could enhance this figure and our understanding of the data set by e.g. color coding by disaster type.
While that gives us a bit of a better idea of the pattern in the data, it is still difficult to tell if there are differences between the disaster types - this also does not let us parse whether this is a consistent pattern across the globe or if it is for example, driven by specific geographic regions.
Is the magnitude of the disaster correlated to the damages it incurs?
Another important relationship to explore is whether the damages incurred by a given natural disaster scales with its magnitude. Both frequency and intensity of climate-related natural disasters are expected to increase so this is an important relationship to understand.
10 Check out the layout guidelines for patchwork if you need a refresher on how to combine plots.
So far we haven’t really found any patterns that really pop out - though keep in mind that “non-patterns” are also noteworthy in an of themselves11.
11 Recall that we’ve discussed the importance of creating hypothesis that are falsifiable. Even if we thing there should be certain relationships, during an exploratory analysis we always need to be open to the idea that certain relationships are not in the data and that we cannot try to manipulate the data to make it seem as if there were.
can we create metrics that help us better understand important relationships?
We do know from our previous plots that we tend to have a pretty tight distribution with damages being pretty similar across events. However, you have probably also noticed that we always see that there are individual events that are outliers in terms of how much damage they incur.
Since are looking at very broad-scale patterns of it might make sense to transform our data in order to get a new metric based on the data at hand that enables a more helpful and informative comparison.
In this case, it could be helpful to calculate the total costs incurred per year for each disaster type and then compare how those total values are changing over time.
Let’s start by creating a new data frame that contains information on total damages incurred per year.
total_yr <- disaster %>%
filter(!is.na(total_damages_000_us)) %>%
group_by(year, disaster_type) %>%
summarize(total_damages_yr = sum(total_damages_000_us)/1000000000)Let’s plot that data set.
Now we’re getting somewhere.
As we are looking at these plots, it is important to keep in mind that increase in total damages per year can be due to a combination of more individual events but all have small number of damages incurred or if we have more incidence of high cost events and of course, it could be a combination thereof.
Let’s create a data set with a limited disaster types so we can elucidate patterns more easily, and while we’re at it also calculate frequency of each disaster type per year.
Now that we’ve discovered a better metric to be looking at, let’s quickly check what the relationship is of frequency and total damages.
Based off of those last two figure we’ve made, the number of events definitely appears to play a role in the total damages due to various disaster types per year, though we would need a more formal analysis to disentangle those effects.
Pick the visualizations that best communicate the pattern you want to highlight
Our multi-panel time-series data seems to have the data transformed in a way that will give us an answer to our question - but it’s not really a great visualization if we wanted to share results from our exploratory analysis. Generally, you’ll hit a point where you know have the data transformed in a way that is useful and you could start playing around with optimizing the presentation of that data set.
At this point we are shifting from our exploratory figures to an explanatory figure that we might include in a report or an update to our collaborators.
One of the problems with our multiplot visualization is that the scales for all of the y-axis are the same because we have some outlier that compress the scale. Maybe a traditional scatterplot is not ideal. Because we are interested in changes in magnitude we could try a less traditional plot, like a bubble chart. Instead of plotting the total damages incurred on the y-axis, we can plot our disaster types and then adjust the size of each bubble according to damages incurred.
Now we’re cooking!
Let’s think of some ways that we can refine this plot. We have overlapping points, we also might want additional bin sizes.
Let’s see if we can fix our legend to something more visible, we might want different colors, and we’ll need to add axis labels, plot titles etc.
can we infer/predict relationships between our sample and the whole population?
A final step in our exploratory analysis is starting to ask whether patterns are specific to the data set (descriptive analysis) or whether there are true non-random relationships between variables (inferential analysis). The overview of the patterns in the data that we get through the exploratory analysis are really important to help us determine what questions/hypothesis are worth exploring more in depth.
Typical questions we might further explore at this point include e.g.
- Is this pattern due to random change or does the independent variable have significant explanatory power?
- How can the relationship implied by the pattern be described (e.g. linear regression)?
- How strong is the relationship (R2 value)?
- What other variables might affect or explain the relationship (e.g. are these variable correlated because they are both correlated to another variable that is responsible for the mechanism?)
- Does this relationship hold across subgroups of the relationship itself?
Fitting trend lines helps elucidate patterns (especially in time series). This is generally the point where exploratory analysis transitions into a more formal analysis that involves statistical tests and fitting models. Keep in mind that we want to have a good grasp on whether our analysis is descriptive, inferential/predictive or causal/mechanistic in nature. Exploratory analysis is “quick and dirty”, you are mainly focused on summarizing the data and highlighting broad-scale patterns that emerge. It is useful for assessing the quality of our data, eye-balling if patterns are what you expect (was your hypothesis correct(ish)), and getting and idea of what the next steps in a more formal analysis should be.
16.8 Step 6: Iterate your exploration
Now that we have an answer(ish) to our question - the first thing we should do is question our answer12. Always consider limitations of your data, whether there are alternative explanations, if your data has some inherent issues you need to be aware of etc. It can also be helpful to find at least one external data source to assess whether your answers are roughly in line with other (expected) measurements.
12 especially if your answer conforms to your a priori expectations!
It is called an exploratory analysis for a reason, because it is just the beginning. Usually this is the beginning of a more sophisticated analysis. Important questions to ask always include
- Is your data appropriate for the question your are asking?
- Do you need additional data to refine your answer?
- Do you have the right question or do you need to refine your question?
Exploratory Analysis frequently is more about hypothesis generating, rather than hypothesis testing. So you data exploration can be an important step to defining the precise (set of) question(s) you want to explore. Not infrequently, the exploration of an initial question leads you to generating additional (perhaps more interesting) questions.
So for example, one of the first questions we might ask is whether the pattern we see with higher damages being incurred holds if we compare this to loss of life.
13 A good way to do this is to briefly describe each figure to summarize the broad pattern(s), then point out what they have in common and what sets them apart
Additionally we could break down regional patterns, or even pull in additional data of GDP of individual countries to compare costs incurred scaled to the economies of different countries.
It’s always helpful to take notes as you go to jot down questions and also ideas on what might explain certain patterns you see.